feat: add Delta Lake support #89

mikix · 2022-12-05T14:43:01Z

Description

Pass --output-format=deltalake to enable. And make sure your AWS glue is set up to support it.

This is not enabled by default (yet) and this commit does not yet do anything clever with incremental bulk exports. But this is a beginning to build upon.

Some ugly parts of this PR:

Java. We are now (behind the scenes via pysparks) downloading jars and running Java code when using delta lake support. There's a native Rust version with python bindings, but it's not feature complete yet.
Delta Lake is VERY SLOW. Even with toy data. Which is annoying in prod and also annoying when running pytest.
Stdout suppression. Pysparks is crazy noisy on the console. So I did quite a little hack to keep it quiet (see deltalake.py for my sins).

Relates to #75 and provides a solution for #76

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2022-12-21T21:23:09Z

cumulus/formats/athena.py

            try:
                self.root.rm(parent_dir, recursive=True)
            except FileNotFoundError:
                pass

        try:
-            full_path = self.root.joinpath(f'{path}.{batch:03}.{self.suffix}')
+            full_path = self.root.joinpath(f'{dbname}/{dbname}.{batch:03}.{self.suffix}')


This change right here is renaming our output files from condition/fhir_conditions.000.ndjson to condition/condition.000.ndjson. Not an important part of this PR, but it made it easier to handle all output tables the same. But it also caused a lot of file/test changes, sorry.

mikix · 2022-12-22T20:12:06Z

cumulus/formats/deltalake.py

+        os.dup2(stderr, 2)
+
+
+class DeltaLakeFormat(AthenaFormat):


I'm not proud of much of this class. From suppressing output, to the list of jars and versions, to converting fsspec s3 params to hadoop s3 params. But it does seem to technically work...

how would you feel about seperating the write and merge? like - just have the etl process create the data and push to S3, and have some kind of service in AWS pick up those files and do the deltalake insertion?

Sorry can you explain the intent there? Separating out the gross Java-requiring code to another step of the process, or is there a fancy AWS thing you are thinking of that saves us overall code or something else?

I'm leery of the control we'd be giving up there. Like in the short term, I'm going to add support for the FHIR server telling us which resources have been deleted since we last exported. How might that look with what you're proposing?

grosser - you'd have to set up some way to call an API or otherwise execute a query.

but doing the data lake injest in aws, you might be able to do a glue job rather than worry about hand managing it: https://aws.amazon.com/marketplace/pp/prodview-seypofzqhdueq - it might be easier overall?

For this, I think I prefer hand managing: more control over flow (like deleting), less AWS-dependency, and less moving parts / infrastructure. And the cost feels light -- 20 lines of "upsert or create" basically.

Are your trade-off sliders set differently, or were you musing aloud?

mostly musing, especially for the frequency we're running jobs at.

mikix · 2022-12-22T20:50:08Z

.github/workflows/ci.yaml

-        python-version: ["3.7", "3.8", "3.9", "3.10", "3.11"]
+        python-version: ["3.7", "3.8", "3.9", "3.10"]


OK how do we feel about this? Pyspark 3.4 officially supports py3.11. But delta-spark is still pinned to 3.3.

So... how much do we feel the need to support py3.11? Our docker image right now is still 3.10, so this doesn't affect our shipped artifacts yet.

i generally have no interest in being leading edge - but we should watch it.

Pass --output-format=deltalake to enable. And make sure your AWS glue is set up to support it. This is not enabled by default (yet) and this commit does not yet do anything clever with incremental bulk exports. But this is a beginning to build upon.

mikix mentioned this pull request Dec 5, 2022

feat: add upsert data lake support #86

Closed

2 tasks

mikix force-pushed the mikix/acidlake branch from 59ae44b to 3121ea9 Compare December 5, 2022 15:22

mikix changed the title ~~feat: add ACI data lake support~~ feat: add ACID data lake support Dec 12, 2022

mikix force-pushed the mikix/acidlake branch 2 times, most recently from 56b1f62 to e015953 Compare December 21, 2022 21:09

mikix changed the title ~~feat: add ACID data lake support~~ feat: add Delta Lake support Dec 21, 2022

mikix force-pushed the mikix/acidlake branch from e015953 to 58e1291 Compare December 21, 2022 21:21

mikix commented Dec 21, 2022

View reviewed changes

mikix commented Dec 22, 2022

View reviewed changes

mikix force-pushed the mikix/acidlake branch from 58e1291 to ce6d13a Compare December 22, 2022 20:45

mikix commented Dec 22, 2022

View reviewed changes

mikix force-pushed the mikix/acidlake branch from ce6d13a to 4b1a0cc Compare December 27, 2022 15:39

dogversioning self-requested a review December 28, 2022 14:10

dogversioning approved these changes Dec 28, 2022

View reviewed changes

feat: add Delta Lake support

674e772

Pass --output-format=deltalake to enable. And make sure your AWS glue is set up to support it. This is not enabled by default (yet) and this commit does not yet do anything clever with incremental bulk exports. But this is a beginning to build upon.

mikix force-pushed the mikix/acidlake branch from 4b1a0cc to 674e772 Compare December 28, 2022 17:18

mikix merged commit fa2fb01 into main Dec 28, 2022

mikix deleted the mikix/acidlake branch December 28, 2022 17:30

mikix mentioned this pull request Dec 28, 2022

Prevent the "data blip" that happens when a run is updating the output files #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Delta Lake support #89

feat: add Delta Lake support #89

mikix commented Dec 5, 2022 •

edited

Loading

mikix Dec 21, 2022

mikix Dec 22, 2022

dogversioning Dec 23, 2022

mikix Dec 23, 2022

dogversioning Dec 23, 2022

mikix Dec 23, 2022

dogversioning Dec 23, 2022

mikix Dec 22, 2022

dogversioning Dec 23, 2022

		python-version: ["3.7", "3.8", "3.9", "3.10", "3.11"]
		python-version: ["3.7", "3.8", "3.9", "3.10"]

feat: add Delta Lake support #89

feat: add Delta Lake support #89

Conversation

mikix commented Dec 5, 2022 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikix commented Dec 5, 2022 •

edited

Loading